Align default argument values with evaluation repository #377

simonrosenberg · 2026-01-28T17:35:12Z

Summary

This PR aligns the default argument values in the benchmarks repository with the values used in the evaluation repository (OpenHands/evaluation).

Changes

Global defaults in `args_parser.py`:

These defaults are the same across all benchmarks and are set directly in args_parser.py:

Argument	Default	Reason
`--workspace`	`remote`	Production uses remote workspaces
`--max-iterations`	`500`	Sufficient for complex tasks
`--critic`	`finish_with_patch`	Ensures agent produces valid patches
`--output-dir`	`./eval_outputs`	Standard output directory
`--n-limit`	`0`	No limit by default

Benchmark-specific defaults via `config.py` and `parser.set_defaults()`:

Each benchmark has a config.py file with INFER_DEFAULTS (and optionally EVAL_DEFAULTS) that are applied via parser.set_defaults():

Benchmark	INFER_DEFAULTS
`commit0`	`dataset`, `split`, `repo_split`, `num_workers=8`, `max_attempts=1`, `max_retries=1`
`gaia`	`dataset`, `split=validation`, `num_workers=30`, `max_attempts=3`
`swebench`	`dataset`, `split=test`, `num_workers=30`, `max_attempts=3`, `max_retries=3`
`swebenchmultimodal`	`dataset`, `split=dev`, `num_workers=30`, `max_attempts=3`, `max_retries=3`
`swtbench`	`dataset` (SWT-bench), `split=test`, `num_workers=30`, `max_attempts=3`, `max_retries=3`

Evaluation defaults (`EVAL_DEFAULTS`):

Benchmark	EVAL_DEFAULTS
`swebench`	`dataset`, `workers=12`
`swebenchmultimodal`	`dataset`, `split=dev`, `workers=12`
`swtbench`	`dataset` (SWE-bench), `split=test`, `workers=24`

Note: swtbench uses different datasets for inference (SWT-bench) vs evaluation (SWE-bench).

Benefits

Consistency: Running benchmarks locally now uses the same defaults as production
Maintainability: Clear separation between global defaults (args_parser.py) and benchmark-specific defaults (config.py)
Single source of truth: Each benchmark's config.py is the authoritative source for its defaults

Testing

All modified files pass pre-commit checks (ruff format, ruff lint, pycodestyle, pyright)
No functional changes to evaluation logic, only default values

Update args_parser.py and benchmark-specific run_infer.py files to use default values that match the evaluation repository (OpenHands/evaluation) eval-job/values.yaml configuration. Shared defaults updated in args_parser.py: - workspace: 'docker' -> 'remote' - max-iterations: 100 -> 500 - critic: 'pass' -> 'finish_with_patch' Benchmark-specific overrides using parser.set_defaults(): - gaia: dataset='gaia-benchmark/GAIA' - swtbench: dataset='eth-sri/SWT-bench_Verified_bm25_27k_zsp' - commit0: max_attempts=1, max_retries=1 (in addition to existing dataset) Also updated AGENTS.md to document the default values alignment pattern. Co-authored-by: openhands <openhands@all-hands.dev>

…hmultimodal - swebench: Add explicit set_defaults(dataset, split) for consistency with other benchmarks, even though values match global defaults - swebenchmultimodal: Update comment to match the pattern used in other benchmarks Co-authored-by: openhands <openhands@all-hands.dev>

Each benchmark now sets its own dataset default via set_defaults(), so no global default is needed. Co-authored-by: openhands <openhands@all-hands.dev>

All benchmarks in the evaluation repository use .llm_config/runtime.json as the LLM config path, so use this as the default. Co-authored-by: openhands <openhands@all-hands.dev>

This reverts commit c34d730.

- Created config.py with INFER_DEFAULTS and EVAL_DEFAULTS for each benchmark - Removed all defaults from utils/args_parser.py - Removed default from critics.py - Updated swebench, gaia, swtbench run_infer.py to use INFER_DEFAULTS - Started commit0 update (import added) Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-29T07:54:35Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #377 at branch `align-default-values-with-evaluation`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

simonrosenberg · 2026-01-29T08:05:59Z

@OpenHands fix precommit issues

openhands-ai · 2026-01-29T08:06:10Z

I'm on it! simonrosenberg can track my progress at all-hands.dev

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-29T08:09:26Z

I've fixed the pre-commit issues and pushed the changes to the PR branch.

Summary of Changes

The pre-commit checks were failing due to ruff lint errors related to import ordering. The fixes were:

benchmarks/commit0/run_infer.py: Removed an unused import (from benchmarks.commit0.config import INFER_DEFAULTS)
benchmarks/swebench/run_infer.py: Reordered imports to follow proper alphabetical ordering (moved from benchmarks.swebench.config import INFER_DEFAULTS after the build_images import block)

Verification

✅ All pre-commit checks now pass (Ruff format, Ruff lint, PEP8 style check, Pyright type check)
✅ Changes are minimal and only address the linting issues
✅ Changes have been committed and pushed to the align-default-values-with-evaluation branch

The PR should now pass the pre-commit CI checks.

View full conversation

Align EVAL_DEFAULTS with NUM_EVAL_WORKERS from evaluation repository values.yaml: - GAIA: workers=1 - Commit0: workers=1 Co-authored-by: openhands <openhands@all-hands.dev>

Update eval_infer.py files to import and use EVAL_DEFAULTS from their respective config.py files via parser.set_defaults(): - swebench/eval_infer.py: uses EVAL_DEFAULTS for dataset, model_name, workers - swtbench/eval_infer.py: uses EVAL_DEFAULTS for dataset, model_name, workers - swebenchmultimodal/eval_infer.py: uses EVAL_DEFAULTS for dataset, split, model_name, workers This ensures the default values defined in config.py are actually used by the evaluation scripts, aligning with the pattern used in run_infer.py files for INFER_DEFAULTS. Co-authored-by: openhands <openhands@all-hands.dev>

…infer.py Update run_infer.py files to import and use INFER_DEFAULTS from their respective config.py files via parser.set_defaults(): - commit0/run_infer.py: uses INFER_DEFAULTS for all inference settings - swebenchmultimodal/run_infer.py: uses INFER_DEFAULTS for all inference settings This ensures the default values defined in config.py are actually used by the inference scripts, completing the alignment with the evaluation repository values.yaml. Co-authored-by: openhands <openhands@all-hands.dev>

Update eval_infer.py files to import and use EVAL_DEFAULTS from their respective config.py files via parser.set_defaults(): - commit0/eval_infer.py: uses EVAL_DEFAULTS for model_name - gaia/eval_infer.py: uses EVAL_DEFAULTS for model_name This ensures all benchmarks consistently use their config.py defaults. Co-authored-by: openhands <openhands@all-hands.dev>

These fields are not benchmark-specific and should have global defaults: - note: 'initial' (user-facing option for run identification) - n_limit: 0 (no limit by default) - output_dir: OUTPUT_DIR from constants.py ('./eval_outputs') Added OUTPUT_DIR constant to benchmarks/utils/constants.py. This keeps INFER_DEFAULTS focused on benchmark-specific values from the evaluation repository's values.yaml. Co-authored-by: openhands <openhands@all-hands.dev>

- gaia: Remove max_retries from INFER_DEFAULTS (not used in run_infer.py) - gaia: Remove workers from EVAL_DEFAULTS (not used in eval_infer.py) - commit0: Remove workers from EVAL_DEFAULTS (not used in eval_infer.py) Each config now only contains fields that are actually used by the corresponding run_infer.py and eval_infer.py scripts. Co-authored-by: openhands <openhands@all-hands.dev>

Remove the default value 'initial' from --note argument. When not specified, no note identifier is appended to the output directory. The construct_eval_output_dir function already handles None/empty values gracefully by not appending the _N_ suffix. Co-authored-by: openhands <openhands@all-hands.dev>

Replace hardcoded dataset, split, and repo_split values with references to INFER_DEFAULTS in: - commit0/run_infer.py: Commit0Evaluation class __init__ and prepare_instances - commit0/build_images.py: set only the specific defaults needed (dataset, split, repo_split) This ensures all commit0 code uses the centralized config values. Co-authored-by: openhands <openhands@all-hands.dev>

The commit0 eval_infer.py is a simple JSON processor that doesn't need centralized defaults. Reverted to main version. Co-authored-by: openhands <openhands@all-hands.dev>

The gaia eval_infer.py is a simple JSON processor that doesn't need centralized defaults. Reverted to main version. Co-authored-by: openhands <openhands@all-hands.dev>

Import DEFAULT_DATASET, DEFAULT_CLI_MODEL_NAME, DEFAULT_EVAL_WORKERS from constants.py instead of duplicating the values. This ensures constants.py remains the single source of truth for these values. Co-authored-by: openhands <openhands@all-hands.dev>

… config.py Remove these constants from constants.py and update eval_infer.py to use EVAL_DEFAULTS from config.py instead. config.py is now the single source of truth for dataset, model_name, and workers defaults. Co-authored-by: openhands <openhands@all-hands.dev>

…VAL_DEFAULTS model_name is specific to the CLI and should stay in constants.py. EVAL_DEFAULTS now only contains dataset and workers. Co-authored-by: openhands <openhands@all-hands.dev>

Revert eval_infer.py files to main and remove model_name from EVAL_DEFAULTS. The model_name is hardcoded in the eval_infer.py files. Co-authored-by: openhands <openhands@all-hands.dev>

…nd swtbench eval_infer Import EVAL_DEFAULTS and use parser.set_defaults() to apply them. model_name remains hardcoded in the argument parser. Co-authored-by: openhands <openhands@all-hands.dev>

…d_eval_env_images Update image_utils.py, build_eval_env_images.py, and eval_infer.py to import and use INFER_DEFAULTS instead of hardcoding dataset and split values. Co-authored-by: openhands <openhands@all-hands.dev>

…EVAL_DEFAULTS image_utils.py and build_eval_env_images.py are used for evaluation, so they should use EVAL_DEFAULTS (princeton-nlp/SWE-bench_Verified) not INFER_DEFAULTS (eth-sri/SWT-bench_Verified_bm25_27k_zsp). Added split='test' to EVAL_DEFAULTS to match values.yaml. Co-authored-by: openhands <openhands@all-hands.dev>

Revert AGENTS.md to main version. Restore original docstring example in build_images.py. Co-authored-by: openhands <openhands@all-hands.dev>

Set workspace default='remote' in args_parser.py since it's the same for all benchmarks. Remove workspace from all INFER_DEFAULTS in config.py files. Co-authored-by: openhands <openhands@all-hands.dev>

…ULTS Set max_iterations default=500 in args_parser.py since it's the same for all benchmarks. Remove max_iterations from all INFER_DEFAULTS in config.py files. Co-authored-by: openhands <openhands@all-hands.dev>

Set critic default='finish_with_patch' in critics.py since it's the same for all benchmarks. Remove critic from all INFER_DEFAULTS in config.py files. Co-authored-by: openhands <openhands@all-hands.dev>

Revert benchmarks/utils/constants.py to main version. Hardcode './eval_outputs' as default for --output-dir in args_parser.py. Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg · 2026-01-29T11:36:49Z

@OpenHands
Read this PR diff
This PR aligns the default argument values in the benchmarks repository with the values used in the evaluation repository (OpenHands/evaluation).

Then clone the evaluation repository and simplify the code as much as possible. Now that the correct values are defined in the benchmarks repository, those values can be totally removed from the evaluation repository.
Try to simplify and clean the evaluation repo code and make a PR with your changes

openhands-ai · 2026-01-29T11:36:59Z

I'm on it! simonrosenberg can track my progress at all-hands.dev

Add level to GAIA INFER_DEFAULTS matching production configuration. Make --level argument optional since it now has a default. Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg · 2026-01-29T11:43:20Z

@OpenHands
/codereview-roasted
Clone the evaluation repository.
Checkout values in eval-job/values.yaml
The purpose of this PR is to move all those hyperparameters from evaluation repo to the benchmarks repository

Please validate that

no hyperparameter has changed values / no behavioral changes are introduced
the logic of this PR is sound

openhands-ai · 2026-01-29T11:43:29Z

I'm on it! simonrosenberg can track my progress at all-hands.dev

openhands-ai · 2026-01-29T11:46:44Z

Summary

I've completed the task of simplifying the OpenHands/evaluation repository to leverage the default values now defined in the OpenHands/benchmarks repository.

Changes Made

Created PR: https://github.com/OpenHands/evaluation/pull/223 (draft)

Files Modified:

eval-job/values.yaml - Removed redundant inference/evaluation defaults:
- Removed: WORKSPACE_TYPE, CRITIC_NAME, NUM_INFER_WORKERS, MAX_ITERATIONS, MAX_ATTEMPTS, MAX_RETRIES, OUTPUT_DIR
- Kept: DATASET, DATASET_SPLIT, REPO_SPLIT, DATASET_INFER, DATASET_EVAL (needed for build workflow dispatch)
run_swebench.sh - Simplified to only pass run-specific arguments (EVAL_LIMIT, MODEL_SLUG, MODEL_NAME)
run_gaia.sh - Simplified to only pass run-specific and GAIA-specific arguments (GAIA_LEVEL, GAIA_SPLIT, EVAL_LIMIT, MODEL_SLUG)
run_commit0.sh - Simplified to only pass run-specific arguments
run_swtbench.sh - Simplified to only pass run-specific arguments
run_swebenchmultimodal.sh - Simplified to only pass run-specific arguments
run_eval.sh - Removed benchmark-specific environment variable requirements that are now handled by benchmarks defaults

Benefits

Reduced duplication: Default values are now defined in one place (benchmarks repo)
Simpler scripts: Shell scripts are shorter and easier to understand (~70 lines removed)
Easier maintenance: Changes to defaults only need to be made in the benchmarks repo
Consistency: Running benchmarks locally uses the same defaults as production

Dependency

This PR depends on OpenHands/benchmarks PR #377 which aligns the default argument values in the benchmarks repository. The evaluation PR should be merged after the benchmarks PR is merged.

View full conversation

openhands-ai · 2026-01-29T11:51:31Z

Code Review: PR #377 - Align Default Argument Values with Evaluation Repository

Taste Rating: 🟡 Acceptable - Works but has one critical gap

Hyperparameter Validation Summary

I compared the PR's config.py files against the evaluation repository's eval-job/values.yaml. Here's the validation:

✅ Values That Match Correctly

Benchmark	Parameter	PR Value	Eval Repo Value	Status
Global	workspace	`remote`	`WORKSPACE_TYPE: remote`	✅
Global	max-iterations	`500`	`MAX_ITERATIONS: 500`	✅
Global	critic	`finish_with_patch`	`CRITIC_NAME: finish_with_patch`	✅
Global	output-dir	`./eval_outputs`	`OUTPUT_DIR: ./eval_outputs`	✅
SWE-bench	dataset	`princeton-nlp/SWE-bench_Verified`	✅
SWE-bench	split	`test`	✅
SWE-bench	num_workers	`30`	✅
SWE-bench	max_attempts	`3`	✅
SWE-bench	max_retries	`3`	✅
SWE-bench	eval workers	`12`	✅
SWT-bench	dataset (infer)	`eth-sri/SWT-bench_Verified_bm25_27k_zsp`	✅
SWT-bench	dataset (eval)	`princeton-nlp/SWE-bench_Verified`	✅
SWT-bench	eval workers	`24`	✅
Commit0	dataset	`wentingzhao/commit0_combined`	✅
Commit0	repo_split	`lite`	✅
Commit0	num_workers	`8`	✅
Commit0	max_attempts	`1`	✅
Commit0	max_retries	`1`	✅
SWE-bench MM	dataset	`princeton-nlp/SWE-bench_Multimodal`	✅
SWE-bench MM	split	`dev`	✅
GAIA	dataset	`gaia-benchmark/GAIA`	✅
GAIA	split	`validation`	✅
GAIA	level	`2023_all`	✅
GAIA	num_workers	`30`	✅
GAIA	max_attempts	`3`	✅

[CRITICAL ISSUES] - Must Fix

🔴 [benchmarks/gaia/config.py] Missing `max_retries`

GAIA's INFER_DEFAULTS is missing max_retries, but the evaluation repo specifies MAX_RETRIES: "3" for GAIA.

Impact: The args_parser.py no longer has a default for max_retries. When gaia/build_images.py passes args.max_retries (which will be None) to build_all_images(), it will crash with:

assert max_retries >= 1, "max_retries must be at least 1"
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

Fix Required:

# benchmarks/gaia/config.py
INFER_DEFAULTS = {
    "dataset": "gaia-benchmark/GAIA",
    "split": "validation",
    "level": "2023_all",
    "num_workers": 30,
    "max_attempts": 3,
    "max_retries": 3,  # ADD THIS LINE
}

🔴 [Multiple build_images.py files] Build scripts don't set `max_retries` defaults

The following build scripts use args.max_retries but don't call parser.set_defaults() with max_retries:

swebench/build_images.py
swebenchmultimodal/build_images.py
gaia/build_images.py
commit0/build_images.py (only sets dataset, split, repo_split)
swtbench/build_eval_env_images.py (only sets dataset, split)

Impact: All build scripts will crash when max_retries=None is passed.

Fix Options:

Add max_retries to each benchmark's INFER_DEFAULTS and use parser.set_defaults(**INFER_DEFAULTS) in build scripts
Or restore a sensible default (e.g., 3) in args_parser.py for max_retries

[IMPROVEMENT OPPORTUNITIES]

🟡 [benchmarks/utils/args_parser.py, Line 80-84] Inconsistent None handling

For run_infer.py scripts, max_retries=None is silently converted to 3 by Pydantic's EvalMetadata model default. For build_images.py scripts, it causes a crash. This inconsistency is confusing.

Suggestion: Either restore a global default in args_parser.py, or ensure all config.py files include max_retries.

VERDICT

❌ Needs rework - The GAIA max_retries omission and build script defaults must be addressed before merging.

KEY INSIGHT: The PR correctly aligns inference defaults but overlooks that build scripts also inherit from args_parser.py and need max_retries to be set somewhere.

Logic Soundness Assessment

The overall architecture is sound:

✅ Centralizing defaults in config.py per benchmark is a good pattern
✅ Using parser.set_defaults() is the correct approach
✅ Separating INFER_DEFAULTS and EVAL_DEFAULTS makes sense
⚠️ The gap is that build scripts weren't updated to use the new pattern consistently

View full conversation

simonrosenberg · 2026-01-29T12:11:55Z

@juanmichelini commit0 was set to max_retries max_attempts = 1 because commit0 runs are extremely long and it would always rerun 3 times (because back then the critic model finish_with_patch was incorrectly setting everyone to false)
Perhaps we should set both values to 3?

- Keep default=3 for max_attempts and max_retries in args_parser.py - Remove redundant max_attempts=3 and max_retries=3 from config.py files (gaia, swebench, swebenchmultimodal, swtbench) since they match the default - Keep max_attempts=1 and max_retries=1 in commit0/config.py since it differs from the default - Remove max_retries from commit0/build_images.py set_defaults (uses global default) Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini · 2026-01-29T12:56:33Z

@juanmichelini commit0 was set to max_retries max_attempts = 1 because commit0 runs are extremely long and it would always rerun 3 times (because back then the critic model finish_with_patch was incorrectly setting everyone to false)

Perhaps we should set both values to 3?

max_retries should always be 3 since it's to mitigate infra errors. max_attempts instead can be set to 1

juanmichelini · 2026-01-29T12:58:02Z

@simonrosenberg why does commit0 only has 8 workers?

simonrosenberg · 2026-01-29T12:59:43Z

@simonrosenberg why does commit0 only has 8 workers?

It was set like this at the start because 16 seemed too much to me. But 16 is actually nothing so it should be bumped to 16

benchmarks/commit0/config.py

openhands-agent added 6 commits January 28, 2026 17:34

Remove default dataset from args_parser.py

dcb940f

Each benchmark now sets its own dataset default via set_defaults(), so no global default is needed. Co-authored-by: openhands <openhands@all-hands.dev>

Add default value for llm_config_path

c34d730

All benchmarks in the evaluation repository use .llm_config/runtime.json as the LLM config path, so use this as the default. Co-authored-by: openhands <openhands@all-hands.dev>

Revert "Add default value for llm_config_path"

6af188a

This reverts commit c34d730.

Fix import ordering to pass ruff lint checks

7fec81f

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-agent added 8 commits January 29, 2026 08:14

Add missing workers field to GAIA and Commit0 EVAL_DEFAULTS

4ed5b72

Align EVAL_DEFAULTS with NUM_EVAL_WORKERS from evaluation repository values.yaml: - GAIA: workers=1 - Commit0: workers=1 Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg force-pushed the align-default-values-with-evaluation branch from 98a6a58 to e53928f Compare January 29, 2026 10:21

openhands-agent added 10 commits January 29, 2026 10:24

Revert commit0/eval_infer.py and remove EVAL_DEFAULTS

9fe08b9

The commit0 eval_infer.py is a simple JSON processor that doesn't need centralized defaults. Reverted to main version. Co-authored-by: openhands <openhands@all-hands.dev>

Revert gaia/eval_infer.py and remove EVAL_DEFAULTS

5b77dfa

The gaia eval_infer.py is a simple JSON processor that doesn't need centralized defaults. Reverted to main version. Co-authored-by: openhands <openhands@all-hands.dev>

Keep DEFAULT_CLI_MODEL_NAME in constants.py, remove model_name from E…

e3b2b2d

…VAL_DEFAULTS model_name is specific to the CLI and should stay in constants.py. EVAL_DEFAULTS now only contains dataset and workers. Co-authored-by: openhands <openhands@all-hands.dev>

Remove model_name from swebenchmultimodal and swtbench EVAL_DEFAULTS

24ba5bd

Revert eval_infer.py files to main and remove model_name from EVAL_DEFAULTS. The model_name is hardcoded in the eval_infer.py files. Co-authored-by: openhands <openhands@all-hands.dev>

Use EVAL_DEFAULTS for dataset, split, workers in swebenchmultimodal a…

19b214f

…nd swtbench eval_infer Import EVAL_DEFAULTS and use parser.set_defaults() to apply them. model_name remains hardcoded in the argument parser. Co-authored-by: openhands <openhands@all-hands.dev>

Revert AGENTS.md and fix commit0/build_images.py docstring

98bc7b4

Revert AGENTS.md to main version. Restore original docstring example in build_images.py. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-agent added 4 commits January 29, 2026 11:10

Move workspace default to args_parser.py, remove from INFER_DEFAULTS

a6507ed

Set workspace default='remote' in args_parser.py since it's the same for all benchmarks. Remove workspace from all INFER_DEFAULTS in config.py files. Co-authored-by: openhands <openhands@all-hands.dev>

Move critic default to critics.py, remove from INFER_DEFAULTS

2443e7d

Set critic default='finish_with_patch' in critics.py since it's the same for all benchmarks. Remove critic from all INFER_DEFAULTS in config.py files. Co-authored-by: openhands <openhands@all-hands.dev>

Revert constants.py, hardcode output-dir default in args_parser.py

2318866

Revert benchmarks/utils/constants.py to main version. Hardcode './eval_outputs' as default for --output-dir in args_parser.py. Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg marked this pull request as ready for review January 29, 2026 11:32

simonrosenberg self-assigned this Jan 29, 2026

Add default level='2023_all' for GAIA benchmark

b8ad269

Add level to GAIA INFER_DEFAULTS matching production configuration. Make --level argument optional since it now has a default. Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg force-pushed the align-default-values-with-evaluation branch from f27ac88 to d7be118 Compare January 29, 2026 12:17

simonrosenberg force-pushed the align-default-values-with-evaluation branch from d7be118 to 19be07f Compare January 29, 2026 12:26

juanmichelini self-requested a review January 29, 2026 12:58

simonrosenberg commented Jan 29, 2026

View reviewed changes

benchmarks/commit0/config.py Outdated Show resolved Hide resolved

Apply suggestion from @simonrosenberg

3d3c73a

simonrosenberg commented Jan 29, 2026

View reviewed changes

benchmarks/commit0/config.py Outdated Show resolved Hide resolved

Apply suggestion from @simonrosenberg

a8052ae

Align default argument values with evaluation repository #377

Are you sure you want to change the base?

Align default argument values with evaluation repository #377

Conversation

simonrosenberg commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Global defaults in args_parser.py:

Benchmark-specific defaults via config.py and parser.set_defaults():

Evaluation defaults (EVAL_DEFAULTS):

Benefits

Testing

Uh oh!

openhands-ai bot commented Jan 29, 2026

Uh oh!

simonrosenberg commented Jan 29, 2026

Uh oh!

openhands-ai bot commented Jan 29, 2026

Uh oh!

openhands-ai bot commented Jan 29, 2026

Summary of Changes

Verification

Uh oh!

simonrosenberg commented Jan 29, 2026

Uh oh!

openhands-ai bot commented Jan 29, 2026

Uh oh!

simonrosenberg commented Jan 29, 2026

Uh oh!

openhands-ai bot commented Jan 29, 2026

Uh oh!

openhands-ai bot commented Jan 29, 2026

Summary

Changes Made

Benefits

Dependency

Uh oh!

openhands-ai bot commented Jan 29, 2026

Code Review: PR #377 - Align Default Argument Values with Evaluation Repository

Taste Rating: 🟡 Acceptable - Works but has one critical gap

Hyperparameter Validation Summary

✅ Values That Match Correctly

[CRITICAL ISSUES] - Must Fix

🔴 [benchmarks/gaia/config.py] Missing max_retries

🔴 [Multiple build_images.py files] Build scripts don't set max_retries defaults

[IMPROVEMENT OPPORTUNITIES]

🟡 [benchmarks/utils/args_parser.py, Line 80-84] Inconsistent None handling

VERDICT

Logic Soundness Assessment

Uh oh!

simonrosenberg commented Jan 29, 2026

Uh oh!

juanmichelini commented Jan 29, 2026

Uh oh!

juanmichelini commented Jan 29, 2026

Uh oh!

simonrosenberg commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

simonrosenberg commented Jan 28, 2026 •

edited

Loading

Global defaults in `args_parser.py`:

Benchmark-specific defaults via `config.py` and `parser.set_defaults()`:

Evaluation defaults (`EVAL_DEFAULTS`):

🔴 [benchmarks/gaia/config.py] Missing `max_retries`

🔴 [Multiple build_images.py files] Build scripts don't set `max_retries` defaults